Lecture 1 - RL Basics

  • Charecteristics :
    1. There's only a reward signal.
    2. Feedback is delayed, not instanteneous
    3. RL data is mostly sequential, its not an iid data, Consecutive samples are highly correlated.
  • Reward R(t) is a scalar feedback signal.
  • Indicates how well agent is doing at step t.
  • The agent's job is to maximize cumulative reward.
  • Even if you multiple goals you probably need to formulate it in some weighting scheme so that the result is a scalar.
  • It's a framework for sequential decision making, actions may have long term consequences. you may sacrifice short term gains for long term rewards.
  • From the agent's perspective, at a time step t , the agent recieves an observation o(t) , recieves reward r(t) and executes action a(t). While from the environments perspective it recieves action a(t), emits observation o(t+1) and emits scalar reward r(t+1).
  • History is the sequence of observations, actions and rewards.
  • Stste is the information use decide to keep so that you can decide your next action.
  • Major Components of a RL agent :
    1. Policy : Behaviour function, What you decide to do . Policies can be deterministic or stochastic .
    2. Value Function :How good is each state / action. Is the prediction of the future reward. Used to evaluate the goodness/badness of states, and therefore to select between actions. There's two kind of value function, State value and action value. there's a discounting process.
    3. Model : agent's representation of the environment, A model predicts what the environment will do next.
  • Two fundamental problems in RL: Learning and planning.
  • Prediction : Evaluate the future, given a policy.
  • Control : Optimize the future, i.e find the best policy.
  • In order to solve a control problem you need to solve a prediction problem .

Markov Decision Process

  • Fully observable env, there are partially observable mdp's as well.
  • A Markov Decision Process is a tuple < S,A,P,R,gamma >
  • S is a finite set of states, A is a finite set of actions, R is a reward function, gamma is a discount factor. P is the probability that you got to state s' when you are in state s given the action and the state. The action is some cases is definite but most often you get a distribution for your action. Its not deterministic.
  • A policy is a distribution over actions given states.
  • A policy fully defines the behaviour of an agent
  • We will assume that policies are stationary (time-independent)

Model Free RL

  • Most problems MDP is unknown.
  • Model Free RL can be done through Monte-Carlo Learning, work but take time. In monte carlo you go till the end then take the gain and update.
  • Another method is temporal-Difference learning, A moving average approach. Ultimately use weighted average of all future steps.
  • Directly solve the control problem.
  • Exploration vs Exploitation problem.

Lecture 2 - Reinforcement Learning

Papers and reading :-

- Sutton & Barto, RL: An intro
- Q Learning
- nature DQN paper
- deep rl bootcamp , https://sites.google.com/view/deep-rl-bootcamp/lectures
  • Optimal Value Function : Sum of discounted rewards when starting from state s and acting optimally . The policy can be stochastic.
  • Q values : Expected utility starting in s and taking action a.
  • Sampling based approximation : Because sometimes you cant explore all states.
  • Q Learning.
  • Exploration and Exploitation

Deep RL

  • Use DNNs to represent value function, policy and model . Optimize loss function by stochastic gradient descent.
  • DQN in Atari [ Use NN to represent Q Function]
  • Gorila ( General Reinforcement Learning Arch)
  • DPN , Use NN to represent the policy , Uses policy gradient
  • Policy Gradient Algorithm

In [ ]: